RMSProp and equilibrated adaptive learning rates for non-convex optimization

نویسندگان

Yann Dauphin

Harm de Vries

Junyoung Chung

Yoshua Bengio

چکیده

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes, i.e., diagonal preconditioners. We show that the optimal preconditioner is based on taking the absolute value of the Hessian’s eigenvalues, which is not what Newton and classical preconditioners like Jacobi’s do. In this paper, we propose a novel adaptive learning rate scheme based on the equilibration preconditioner and show that RMSProp approximates it, which may explain some of its success in the presence of saddle points. Whereas RMSProp is a biased estimator of the equilibration preconditioner, the proposed stochastic estimator, ESGD, is unbiased and only adds a small percentage to computing time. We find that both schemes yield very similar step directions but that ESGD sometimes surpasses RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Equilibrated adaptive learning rates for non-convex optimization

متن کامل

Variants of RMSProp and Adagrad with Logarithmic Regret Bounds

Adaptive gradient methods have become recently very popular, in particular as they have been shown to be useful in the training of deep neural networks. In this paper we have analyzed RMSProp, originally proposed for the training of deep neural networks, in the context of online convex optimization and show √ T -type regret bounds. Moreover, we propose two variants SC-Adagrad and SC-RMSProp for...

متن کامل

Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Adaptive learning rate algorithms such as RMSProp are widely used for training deep neural networks. RMSProp offers efficient training since it uses first order gradients to approximate Hessianbased preconditioning. However, since the first order gradients include noise caused by stochastic optimization, the approximation may be inaccurate. In this paper, we propose a novel adaptive learning ra...

متن کامل

Convergence Rate of Sign Stochastic Gradient Descent for Non-convex Functions

The sign stochastic gradient descent method (signSGD) utilises only the sign of the stochastic gradient in its updates. For deep networks, this one-bit quantisation has surprisingly little impact on convergence speed or generalisation performance compared to SGD. Since signSGD is effectively compressing the gradients, it is very relevant for distributed optimisation where gradients need to be a...

متن کامل

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1502.04390 شماره

صفحات -

تاریخ انتشار 2015

RMSProp and equilibrated adaptive learning rates for non-convex optimization

نویسندگان

چکیده

منابع مشابه

Equilibrated adaptive learning rates for non-convex optimization

Variants of RMSProp and Adagrad with Logarithmic Regret Bounds

Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Convergence Rate of Sign Stochastic Gradient Descent for Non-convex Functions

The Marginal Value of Adaptive Gradient Methods in Machine Learning

عنوان ژورنال:

اشتراک گذاری